1 Overview

This document illustrates the application of the three-stage methodology using national healthcare (or mortality) data to predict cancer incidence (CI) in geographical areas without an cancer registry. The method and the R code used is illustrated for two cancer sites: pancreatic cancer in men and melanoma in women over the 2007-2015 period. This document details the different steps of calibration, validation and post-processing of the data. Its organisation follows the method section of the article.

Note that, in order to comply with diffusion legal constrains, the numbers of cancer and HC cases in the datasets were slightly altered ; so the results obtained here slightly differ from those presented in the article.

1.1 R-package

The material and R code used in this application are part of the R-package CalibInc, available in Github. Use the package devtools to install the package CalibInc:

require(devtools)
install_github("echatignoux/CalibInc")

After installation, the package can be loaded into R.

library(CalibInc)

The tidyverse, magrittr, mgcv and spdep packages are also needed to run the following code. The spatial smoothing is done using the R-INLA package (see http://www.r-inla.org for installation instruction).

library(tidyverse)
library(magrittr)
library(mgcv)
library(spdep)
library(INLA)

1.2 Data sources

The application of the method is illustrated here for two cancer sites, pancreatic cancer in men and melanoma in women, for the 2007-2015 period.

To evaluate the calibration model, incidence and HC data for districts covered by a cancer registry are needed. They are provided in the table dt_calib.

To predict incidence in all districts, we need HC data for all the French districts. These data are in the table dt_pred.

2 Calibration step

2.1 Overview of the data

The data set used for calibration and validation consists, in each district covered by a cancer registry, in the number of cancer cases (N_I), the number of patients retrieved in HC data (N_HC), aggregated by cancer cause (site), district (dist), HC source (SRC) and 5 years age groups (column age is the median of age group). Additional columns py and pop_WHO respectively refer to the number of person-year (sum over period 2007-2015) and to the standard WHO population of 1960.

The structure of the data is presented below:

set.seed(200)
dt_calib%>%sample_n(5)
## # A tibble: 5 x 8
##   site           dist  SRC     age   N_I  N_HC      py pop_WHO
##   <fct>          <fct> <fct> <dbl> <int> <dbl>   <dbl>   <dbl>
## 1 Melanoma-women d_13  HA     67.5    48    48 154338.    3000
## 2 Pancreas-men   d_20  HA     32.5     0     0  72027     6000
## 3 Melanoma-women d_7   M      57.5   141    15 397739     4000
## 4 Pancreas-men   d_4   H      17.5     0     0 157488.    9000
## 5 Melanoma-women d_7   H      32.5    50    21 381150.    6000

2.2 Evaluation of the calibration model

2.2.1 Estimation

The calibration model (model 1 in the manuscript) is evaluated, for each cancer site and health-care proxy, via the gam function from mgcv package. The log of the number of incident cancer cases is introduced as an offset in the model (thus, rows with no incident cancer cases are not used to evaluate the calibration model). The age effect is introduced as a thin-plate spline (s(age,bs="tp")), with one knot at distinct values of observed age (knots=nk). The district effect is modelled as an intercept random effect (s(dist,bs="re")).

The model formula is specified as below:

nk<-list(age=unique(dt_calib$age)) 
form<-N_HC~offset(log(N_I))+s(age,bs="tp")+s(dist,bs="re") 

And the models evaluated separately for each cancer site and proxy:

Pancreas - men

dt_panc<-dt_calib%>%filter(site=="Pancreas-men")%>%droplevels()
m_panc_A<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="A",N_I>0),method="REML")
m_panc_H<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="H",N_I>0),method="REML")
m_panc_HA<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="HA",N_I>0),method="REML")
m_panc_M<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="M",N_I>0),method="REML")

Melanoma - women

dt_mela<-dt_calib%>%filter(site=="Melanoma-women")%>%droplevels()
m_mela_A<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="A",N_I>0),method="REML")
m_mela_H<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="H",N_I>0),method="REML")
m_mela_HA<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="HA",N_I>0),method="REML")
m_mela_M<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="M",N_I>0),method="REML")

2.2.2 Model checking

The quality of fit of the models can be visualized via the gam.check function. For the melanoma model with A proxy for example, no influential observation nor violation of model assumptions is apparent.

gam.check(m_mela_A)

Similar results apply to the other models.

2.2.3 HC/I ratios

The shape of the ratios may be visualized with the plot.gam function, or obtained as predictions with predict.gam.

Using predict.gam, we may superimpose modelled mean HC/I ratio (red curve) with the ratio observed in districts covered by registries (grey points and lines):

Code for graphs of HC/I ratios - pancreas, men

## Predictions with N_I=1 (i.e. mean ratio) for each proxy
dt_rat<-tibble(age=unique(dt_calib$age),N_I=1,dist="d_1")
dt_rat$rat_A<-predict(m_panc_A,exclude="s(dist)",newdata=dt_rat,type="response")
dt_rat$rat_H<-predict(m_panc_H,exclude="s(dist)",newdata=dt_rat,type="response")
dt_rat$rat_HA<-predict(m_panc_HA,exclude="s(dist)",newdata=dt_rat,type="response")
dt_rat$rat_M<-predict(m_panc_M,exclude="s(dist)",newdata=dt_rat,type="response")

## Convertign to raw data
dt_rat<-dt_rat%>%
  gather(SRC,rat,rat_A:rat_M)%>%
  mutate(SRC=gsub("rat_","",SRC))

## Graph
dt_panc%>%
  ggplot(data=.)+
  geom_jitter(aes(age,N_HC/N_I,group=dist,size=N_I),alpha=0.2,height = 0,width = 0.5)+
  geom_line(aes(age,N_HC/N_I,group=dist),alpha=0.2)+
  geom_line(data=dt_rat,aes(age,rat),colour="red",size=1)+
  coord_cartesian(ylim=c(0,2),xlim=c(20,90))+
  facet_wrap(~SRC)+
  ylab("HC/I")+xlab("Age")+
  theme(legend.position = "bottom")+labs(size="# I")
HC/I ratio by age - pancreas, men

Figure 2.1: HC/I ratio by age - pancreas, men

HC/I ratio by age - melanoma, women

Figure 2.2: HC/I ratio by age - melanoma, women

The graphs first show that the heights and shapes of the HC/I ratios by age vary between cancer sites and proxys. Second, the spread of districts specific ratios around the mean ratio also varies among proxys and cancer site. If we skip younger ages for which ratios are highly variables due to small number of cases, district ratios generally look more clustered around the mean ratio for pancreatic cancer in men than for melanoma in women.

The visual impression is confirmed by the values of the standard deviation \(\sigma_d\) of the district random effects, which are obtained via the gam.vcomp function (line ‘(dist)’):

gam.vcomp(m_mela_A)
## 
## Standard deviations and 0.95 confidence intervals:
## 
##             std.dev      lower       upper
## s(age)  0.003601549 0.00152346 0.008514276
## s(dist) 0.176109559 0.12110038 0.256106357
## scale   1.025679552 0.93756325 1.122077409
## 
## Rank: 3/3

Applied to each model, it appears clearly that district deviations from the mean calibration curves are much lower for pancreatic cancer in men than melanoma in women.

Table 2.1: Between district-variability of the HC/I ratios
\(\sigma_d\)
Cancer site A H HA M
Pancreas-men 0.02 0.06 0.02 0.06
Melanoma-women 0.18 0.21 0.15 0.17

3 Predictions in a district

For each model, the predictions are obtained with the CalibInc function from CalibInc package.

The code is illustrated below for prediction of melanoma cancer incidence in women, using H proxy.

3.1 Number of incident cases

For each district, the total number of cancer cases predicted with H proxy (and associated standard errors) is obtained from the calibration model m_mela_H and H data as follow:

dt_mela_H<-dt_calib%>%filter(site=="Melanoma-women",SRC=="H")
dt_pred_mela_H<-CalibInc(mod=m_mela_H,pred=dt_mela_H,aggregate=~dist)
dt_pred_mela_H
## # A tibble: 17 x 4
##    dist   N_HC  pred    se
##    <fct> <dbl> <dbl> <dbl>
##  1 d_1     360  630. 142. 
##  2 d_2     160  282.  66.4
##  3 d_3     427  729. 163. 
##  4 d_5     298  528. 120. 
##  5 d_7     560  986. 219. 
##  6 d_8     355  641. 145. 
##  7 d_9     792 1433. 316. 
##  8 d_10   1015 1819. 400. 
##  9 d_11    358  596. 134. 
## 10 d_12    256  445. 102. 
## 11 d_13    349  615. 139. 
## 12 d_15    185  323.  75.4
## 13 d_16    202  352.  81.7
## 14 d_17    176  310.  72.5
## 15 d_18    319  546. 124. 
## 16 d_19    185  318.  74.1
## 17 d_20    177  303.  70.9

The aggregate argument of the CalibInc function is used here to specify the aggregation level of the predictions (the district level here).

Predictions intervals at 95% level may subsequently be added to the result table by using the LogNormPI function from CalibInc.

dt_pred_mela_H%>%LogNormPI()
## # A tibble: 17 x 6
##    dist   N_HC  pred    se   low    up
##    <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 d_1     360  630. 142.   418.  999.
##  2 d_2     160  282.  66.4  183.  456.
##  3 d_3     427  729. 163.   484. 1152.
##  4 d_5     298  528. 120.   349.  841.
##  5 d_7     560  986. 219.   657. 1553.
##  6 d_8     355  641. 145.   425. 1017.
##  7 d_9     792 1433. 316.   957. 2249.
##  8 d_10   1015 1819. 400.  1217. 2850.
##  9 d_11    358  596. 134.   395.  946.
## 10 d_12    256  445. 102.   293.  711.
## 11 d_13    349  615. 139.   407.  975.
## 12 d_15    185  323.  75.4  211.  521.
## 13 d_16    202  352.  81.7  231.  567.
## 14 d_17    176  310.  72.5  202.  500.
## 15 d_18    319  546. 124.   361.  868.
## 16 d_19    185  318.  74.1  208.  512.
## 17 d_20    177  303.  70.9  198.  489.

Similarly, the predicted number of incident cases by age and districts may be obtained by modifying the aggregate argument:

CalibInc(mod=m_mela_H,pred=dt_mela_H,aggregate=~dist+age)%>%
  sample_n(5)%>%
  LogNormPI()
## # A tibble: 5 x 7
##   dist    age  N_HC  pred    se    low    up
##   <fct> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 d_2    37.5     6  13.6  6.87   6.00  38.8
## 2 d_12   37.5    10  22.7  9.41  11.3   53.6
## 3 d_5    37.5    18  40.9 14.0   22.5   82.8
## 4 d_20   52.5    12  25.2  9.87  12.9   56.8
## 5 d_20   22.5     0   0    0    NaN    NaN

3.2 Standardized incidence

Interest usually lies on standardised summary of the incidence in the districts, such as standardized incidence rate or standardized incidence ratio.

3.2.1 Standardized incidence ratio

The standardized incidence ratio in a district \(j\) is calculated as the ratio of the predicted (\(\hat{I}_j=\sum_i \hat{I}_{ij}\)) to the expected total number of cases in the district \(SIR_j=\frac{\hat{I}_j}{E_j}\).

The expected number of cases is obtained by applying the age-incidence rate in the reference population to the size of the district populations, i.e. \(E_j=\sum_i \frac{\hat{I}_{i.}}{py_{i.}}\times py_{ij}\), where \(\hat{I}_{i.}\) and \(py_{i.}\) refer respectively to the total number of predicted cancer cases and person-year of age \(i\) in the reference population.

To obtain standardized incidence ratio, we need to calculate the incidence rate by age in the reference population. If this population is taken as the districts covered by a cancer registry for example, we can use the following code to calculate SIR using the predicted incidence:

## 1- Calculate age incidence rate
## Total number of incident cases by age
dt_mela_H_rate<-CalibInc(mod=m_mela_H,pred=dt_mela_H,aggregate=~age)

## Add total number of py by age and calculate the rate
dt_mela_H_rate<-dt_mela_H_rate%>%
  left_join(
    dt_mela_H%>%
      group_by(age)%>%
      summarise(py=sum(py)))%>%
  mutate(rate=pred/py)

## 2- Calculate expected number of cases
dt_mela_H_E<-dt_mela_H%>%
  left_join(
    dt_mela_H_rate%>%select(age,rate)
  )%>%
  group_by(dist)%>%
  summarise(E=sum(py*rate))

## 3- Add expectd count to predicted counts and calculate the SIR
dt_SIR_mela_H<-dt_pred_mela_H%>%
  left_join(
    dt_mela_H_E)%>%
  mutate(SIR=pred/E)

3.2.2 Standardized incidence rate

Standardized incidence rates \(T_j\) are calculated in a given area \(j\) as the sum of incidence rate multiplied by the person year of a standard population \(py^r_{i}\) : \(\hat{T}_j=\sum_i \frac{\hat{I}_{ij}}{py_{ij}}\times py^r_{i}=\sum_i \hat{I}_{ij}\times \frac{py^r_{i}}{py_{ij}}\).

The CalibInc function possess a weight argument that can be used for this purpose. To calculate incidence rate standardized on the age structure of the world’s population in each districts covered by a cancer registry one can use:

## Add weights to dt_mela_H table
dt_mela_H_T<-dt_mela_H%>%mutate(w=pop_WHO/py)
## Calculate standardized incidence rates in the districts using the weights
CalibInc(mod=m_mela_H,pred=dt_mela_H_T,weight=w,aggregate=~dist)
## # A tibble: 17 x 4
##    dist   N_HC  pred    se
##    <fct> <dbl> <dbl> <dbl>
##  1 d_1    6.36 12.3   2.81
##  2 d_2    6.00 11.6   2.83
##  3 d_3    7.95 15.2   3.47
##  4 d_5    7.41 14.3   3.30
##  5 d_7    5.16  9.96  2.23
##  6 d_8    4.34  8.55  1.95
##  7 d_9    9.10 17.7   3.93
##  8 d_10  10.2  19.9   4.40
##  9 d_11   7.22 13.5   3.13
## 10 d_12   3.70  7.02  1.63
## 11 d_13   5.96 11.5   2.64
## 12 d_15   6.73 13.1   3.16
## 13 d_16   4.42  8.50  2.02
## 14 d_17   5.51 10.9   2.66
## 15 d_18   5.56 10.6   2.46
## 16 d_19   5.64 10.8   2.60
## 17 d_20   6.30 12.2   2.96

4 Evaluation of the quality of the predictions and of their usefulness for epidemiology:

To evaluate the quality of the predictions, we estimate, in the districts with registers, the gaps between predicted CI and observed CI. This is done using a cross validation procedure as follows: for each district \(j\) with a register, predicted CI were obtained using the HC/I ratio estimated by the model (1) evaluated after excluding district \(j\) and then prediction error (PE) between the total number of predicted cases (\(\hat{I}_{j}=\sum_i \hat{I}_{ij}\)) and the total number of observed incident cases (\(I_{j}=\sum_i I_{ij}\)) is computed: \(PE=\frac{\hat{I}_{j}-I_{j}}{I_{j}}\)

The epidemiological usefulness of CI predictions is evaluated using two principles. First, predictions with large errors (ie large PE, see previous paragraph) observed in districts with a cancer registry may be misleading and should be discarded. Second, predictions with errors of moderate size may provide important information about the spatial variation of the specific cancer site if the between-district variation of CI is much larger than the errors (i.e., if the “signal-to-noise” tradeoff remains favorable).

This signal-to-noise ratio is assessed by comparing the between-district variability of the HC/I ratio, which represents “noise” or error, with the between-district variability of incidence, which represents the epidemiological signal. Both are measured in districts with a cancer registry. The former constitutes the standard deviation \(\sigma_d\) from the calibration model (1). Fitting a Poisson mixed model to the observed CI rate, including age (penalized regression spline) and district random effect, the latter is the standard deviation \(\sigma_k\) of this random effect.

The above key quantities (PE and \(\sigma_d\) vs \(\sigma_k\)) are then placed in a decision tree, depicted in Figure 1, for each cancer site and gender, to decide whether a given method (i.e. the use of a given proxy) provides predictions of CI of adequate quality. When the absolute value of PE (APE) is low for all the districts with a register (i.e., below 15%), predictions are considered valid with this method (method rated A++). On the contrary, if at least one district has a high APE (i.e., larger than 30%) or if more than 20% of districts have large errors (between 15 and 30%), then the predictions are considered of insufficient quality (method rated B–). For intermediate-sized prediction errors (i.e., fewer than 20% of districts with an APE between 15 and 30%), predictions may be considered informative if geographical variations in CI rates are larger than the error (i.e., \(\sigma_k>2\times \sigma_d\), which translates into a correlation between predicted and true incidence of >0.9) (method rated A+). Methods rated A++ or A+ are considered suitable to predict CI in all districts.

4.1 Predictions errors in cross-validation

For each district in the registry area, the calibration model is fitted without the district and the total number of incident cases predicted in this district. For pancreas cancer in men, for district “d_1” and proxy “A”, we use the following code:

m_panc_cv_d1_A<-gam(form,family="quasipoisson",knots=nk,
       data=dt_panc%>%filter(SRC=="A",dist!="d_1",N_I>0),method="REML")
cv_d1_A<-CalibInc(m_panc_cv_d1_A,pred=dt_panc%>%filter(SRC=="A",dist=="d_1"),aggregate=~1)
cv_d1_A
## # A tibble: 1 x 3
##    N_HC  pred    se
##   <dbl> <dbl> <dbl>
## 1   361  541.  34.2

The relative difference is then computed by comparing predictions to observed number of cases:

cv_d1_A%>%
  bind_cols(dt_panc%>%filter(SRC=="A",dist=="d_1")%>%summarise(N_I=sum(N_I)))%>%
  mutate(PE=100*(pred-N_I)/N_I)
## # A tibble: 1 x 5
##    N_HC  pred    se   N_I    PE
##   <dbl> <dbl> <dbl> <int> <dbl>
## 1   361  541.  34.2   519  4.21

The two steps are then repetaed for each district in the registry area.

Code to compute complete cross-validation - pancreas, men

cv_panc<-dt_panc%>%
  group_by(dist,SRC)%>%
  nest()%>%
  mutate(cv=pmap(list(dist,SRC,data),
                 function(d,src,dt){
    m<-gam(form,family="quasipoisson",knots=nk,
           data=dt_panc%>%filter(SRC==src,dist!=d,N_I>0),method="REML")
    CalibInc(m,pred=dt%>%mutate(dist=d),aggregate=~1)
                 }))%>%
  unnest(cv)%>%
  select(-N_HC)%>%
  unnest(data)%>%
  group_by(dist,SRC)%>%
  summarise(pred=mean(pred),se=mean(se),N_I=sum(N_I),N_HC=sum(N_HC))%>%
  mutate(PE=100*(pred-N_I)/N_I)

Code to compute complete cross-validation - melanoma, women

cv_mela<-dt_mela%>%
  group_by(dist,SRC)%>%
  nest()%>%
  mutate(cv=pmap(list(dist,SRC,data),
                 function(d,src,dt){
    m<-gam(form,family="quasipoisson",knots=nk,
           data=dt_mela%>%filter(SRC==src,dist!=d,N_I>0),method="REML")
    CalibInc(m,pred=dt%>%mutate(dist=d),aggregate=~1)
                 }))%>%
  unnest(cv)%>%
  select(-N_HC)%>%
  unnest(data)%>%
  group_by(dist,SRC)%>%
  summarise(pred=mean(pred),se=mean(se),N_I=sum(N_I),N_HC=sum(N_HC))%>%
  mutate(PE=100*(pred-N_I)/N_I)

The results are stored in tables cv_panc and cv_mela for pancreas and melanoma cancers, respectively, printed below in tables 4.1 and 4.2.

For pancreas cancer in men, predictions errors are relatively small for HA/I method (all below 10% in absolute value), whereas for the other methods, errors are generally moderate, but consistent in some districts (PE=17% for district d_2 with A/I method, 16% and -17% for districts d_14 and d_15 with H/I, and 19% for district d_20 with M/I method).

Table 4.1: Detailed cross-validation in districts for pancreas, men
A/I
H/I
HA/I
M/I
dist N_I N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE
d_1 519 361 540.9 4.2 486 517 -0.4 597 530.3 2.2 513 554.4 6.8
d_2 291 230 343.1 17.9 256 270.8 -7 332 294.4 1.2 277 301.4 3.6
d_3 572 379 569.9 -0.4 556 595.9 4.2 657 585.7 2.4 542 570.7 -0.2
d_4 442 292 424.7 -3.9 385 408.5 -7.6 478 423.2 -4.3 391 413.3 -6.5
d_5 426 300 452.6 6.2 378 401.5 -5.7 459 407 -4.5 379 407.2 -4.4
d_6 633 399 586.4 -7.4 578 615 -2.8 719 638.3 0.8 625 677 6.9
d_7 947 674 996.4 5.2 947 1012.6 6.9 1113 991.4 4.7 816 873.9 -7.7
d_8 1089 729 1075.7 -1.2 961 1020.4 -6.3 1184 1049.3 -3.6 912 971.7 -10.8
d_9 972 681 1023.2 5.3 853 906 -6.8 1071 950.3 -2.2 843 907.2 -6.7
d_10 954 627 925.7 -3 838 888.2 -6.9 1050 929.6 -2.6 839 895.5 -6.1
d_11 410 278 418.4 2 385 410.6 0.1 461 409.5 -0.1 393 414.1 1
d_12 657 423 607.7 -7.5 661 705.1 7.3 771 684.7 4.2 599 646.5 -1.6
d_13 583 383 563.8 -3.3 550 586 0.5 641 567.5 -2.7 550 597.5 2.5
d_14 509 320 471.4 -7.4 552 593.7 16.6 614 547.7 7.6 494 525 3.2
d_15 300 215 333.8 11.3 235 250.1 -16.6 307 273.7 -8.8 269 280 -6.7
d_16 420 291 415 -1.2 443 471.8 12.3 515 456.4 8.7 428 471.6 12.3
d_17 362 228 344 -5 334 358 -1.1 384 341.5 -5.7 315 334.8 -7.5
d_18 553 373 547.1 -1.1 561 600.8 8.6 643 572.2 3.5 559 598.9 8.3
d_19 285 203 299.5 5.1 279 298.4 4.7 323 287.3 0.8 281 300.2 5.3
d_20 250 165 241 -3.6 241 257.8 3.1 287 255 2 278 297.7 19.1
Total 11174 7551 11180.5 17.9/4.6 10479 11168 16.6/6.5 12606 11195.1 8.8/3.1 10303 11038.7 19.1/6.6
Note: At the line “Total”, maximum and median absolute prediction error is reported in the PE column.

For melanoma, predictions errors are much larger with each method, with consistent errors in many districts (median absolute PE > 12%), and very large errors in some districts (at least on district with >30% absolute PE for each method).

Table 4.2: Detailed cross-validation in districts for melanoma, women
A/I
H/I
HA/I
M/I
dist N_I N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE
d_1 560 514 644.2 15 360 635.4 13.5 673 645.5 15.3 91 774.1 38.2
d_2 312 195 240.9 -22.8 160 280.1 -10.2 269 255 -18.3 30 269.6 -13.6
d_3 641 452 548.6 -14.4 427 735.6 14.8 661 617.6 -3.7 84 665.2 3.8
d_5 419 351 440.8 5.2 298 537.2 28.2 475 460.6 9.9 57 442.4 5.6
d_7 1352 765 935.6 -30.8 560 968.9 -28.3 1021 960.4 -29 127 1009.5 -25.3
d_8 689 555 688.6 -0.1 355 639.6 -7.2 713 692.4 0.5 116 1027.3 49.1
d_9 1336 924 1167 -12.6 792 1440.7 7.8 1306 1283.5 -3.9 107 856.8 -35.9
d_10 1471 1150 1436.7 -2.3 1015 1852.2 25.9 1625 1578.4 7.3 172 1400.7 -4.8
d_11 465 491 609.1 31 358 607.2 30.6 660 618.5 33 92 678.4 45.9
d_12 827 497 619.2 -25.1 256 432.9 -47.7 639 612.6 -25.9 85 647.2 -21.7
d_13 803 548 695.9 -13.3 349 605.4 -24.6 693 672.4 -16.3 85 652.7 -18.7
d_15 278 256 319.4 14.9 185 326.4 17.4 328 312.7 12.5 37 296.5 6.7
d_16 374 377 485.2 29.7 202 351.1 -6.1 445 434.8 16.2 59 540 44.4
d_17 303 239 296.8 -2 176 310 2.3 314 299 -1.3 40 256.6 -15.3
d_18 609 568 707.8 16.2 319 543.2 -10.8 678 638.2 4.8 88 723 18.7
d_19 306 226 279.1 -8.8 185 318.5 4.1 308 290.9 -4.9 35 280.8 -8.2
d_20 318 318 407.3 28.1 177 302.3 -4.9 366 352.1 10.7 30 246.1 -22.6
Total 11063 8426 10522.2 31/14.9 6174 10886.5 47.7/13.5 11174 10724.6 33/10.7 1335 10766.9 49.1/18.7
Note: At the line “Total”, maximum and median absolute prediction error is reported in the PE column.

The following section presents the code to draw graphical representations of cross-validation results that are not mentioned in the article. This section in not essential and may be skipped by the reader.

Predictions errors in cross-validation may advantageously be plotted under the form of funnel plots, which allows for a rapid examination of the extend of the errors given their precision (Spiegelhalter 2005). In these funnel plots, the precision of the errors was defined as the inverse of its coefficient of variation, and confidence limits at the 95% levels under the null hypothesis of zero error are drawn.

Code for funnel plot - pancreas, men

## Define the 95% confidence limits given the range of the precisions (1/cv)
## of the errors (number of predicted cases follow a log-normal distribution).
level<-0.95
z <- qnorm(1 - (1 - level)/2)
cvs<-range(bind_rows(cv_panc,cv_mela)%>%mutate(cv=se/pred)%$%cv)
fun_bound<-tibble(cv=seq(cvs[1],cvs[2],length=100),pred=100,se=cv)%>%
  mutate(low = pred * sqrt(cv^2 + 1) * exp(-z * sqrt(log(cv^2 + 1))),
         up = pred * sqrt(cv^2 + 1) * exp(+z * sqrt(log(cv^2 + 1))))

## Plot the errors (point size proportionnal to the number of incident cases)
## and add confidence limits
cv_panc%>%
  ggplot(data=.)+
  geom_point(aes(PE,pred/se,size=N_I),alpha=0.2)+
  geom_line(data=fun_bound,aes(up-100,1/se))+
  geom_line(data=fun_bound,aes(low-100,1/se))+
  facet_wrap(~SRC)+
  coord_cartesian(xlim=c(-50,50))+
  theme(legend.position="bottom")+
  ylab("1/cv")

From the funnel plots for pancreas cancer in men, it first should be noticed that all the districts lies within or very close to the 95% confidence limits. Thus, no district appears as influential or outlier. Second, it appears that prediction errors are larger for H/I and M/I methods (i.e. lower precision - y axis) than for A/I and HA/I. Those results are in line with the estimations of table 2.1 (the district variability of the ratios \(\sigma_d\) are higher for H/I and M/I methods).

Funnel plot of predicitons errors in cross-validaion - pancreas, men

Figure 4.1: Funnel plot of predicitons errors in cross-validaion - pancreas, men

The funnel plots for melanoma in women show a very different pattern. Predictions errors are much larger for every method, and the precision is rather poor.

Funnel plot of predicitons errors in cross-validaion - melanoma, women

Figure 4.2: Funnel plot of predicitons errors in cross-validaion - melanoma, women

An other useful representation is to plot observed and predicted standardized indicators of incidence. We may for example plot the observed and predicted SIR (along with 95% PI) for each district arranged in increasing value of SIR, using a so called caterpillar plot. This type of representation allows to better appreciate the effects of the error on the indicators of interest.

Code for Caterpillar plot - pancreas, men

## Calculate expected number of cases in each district in the registry area
E_panc<-dt_panc%>%
  group_by(SRC)%>%
  nest()%>%
  mutate(E=pmap(list(SRC,data),
                 function(src,dt){
                   ## Get calibration model
                   m<-get(paste0("m_panc_",src))
                   ## Calculate incidence rates by age in the registry area
                   E<-dt%>%
                     group_by(age)%>%
                     mutate(w=1/sum(py))%>%
                     CalibInc(m,pred=.,aggregate=~age,weight=w)
                   ## Add district's py and sum py*rate by district
                   E%>%
                     left_join(dt%>%select(dist,age,py))%>%
                     group_by(dist)%>%
                     summarise(E=sum(pred*py))
                 }))%>%
  unnest(E)
## Add expected to predictions in cross-validation and 95% PI
SIR_panc_cv<-E_panc%>%left_join(cv_panc%>%select(SRC,dist,pred,se))%>%
  LogNormPI()
## Calculate the ratio
SIR_panc_cv<-SIR_panc_cv%>%mutate(pred=pred/E,se=se/E,low=low/E,up=up/E)

## Idem with observed incidence
SIR_panc_obs<-dt_panc%>%
  group_by(SRC,age)%>%
  mutate(tx=sum(N_I)/sum(py))%>%
  group_by(SRC,dist)%>%
  summarise(O_obs=sum(N_I),E_obs=sum(py*tx),SIR_obs=O_obs/E_obs)

## Plot the results
SIR_panc_cv<-SIR_panc_cv%>%
  left_join(SIR_panc_obs)%>%
  group_by(SRC)%>%
  arrange(SRC,SIR_obs)%>%
  mutate(rg=row_number())

SIR_panc_cv%>%
  ggplot(data=.)+
  geom_point(aes(rg-0.2,SIR_obs,size=O_obs,color="Obs."),alpha=0.3)+
  geom_pointrange(aes(rg+0.2,pred,ymin=low,ymax=up,color="Pred."))+
  facet_wrap(~SRC)+
  geom_hline(aes(yintercept=1))+
  xlab("District")+ylab("SIR")+
  labs(size="# I",colour="")+
  scale_x_continuous(breaks=unique(SIR_panc_cv$rg),labels=unique(SIR_panc_cv$dist),minor_breaks =NULL)+
  theme(axis.text.x=element_text(angle=90,hjust = 1,vjust=.5),legend.position="bottom",
        panel.grid.major.x = element_blank())

For pancreas for example, we can see that the 17.5% prediction error in district d_2 observed with A/I method (see table 4.1) results in predicted SIR above 1.1 (i.e. incidence in excess in the district), whereas observed SIR is indeed just below 1 (i.e. tiny under-incidence in the district).

Funnel plot of SIR predicited in cross-validaion - pancreas, men

Figure 4.3: Funnel plot of SIR predicited in cross-validaion - pancreas, men

For melanoma it is clear from the plot that the errors brings a lot of diverging conclusions about which district stay above or below the mean level of incidence in districts covered by registries.

Funnel plot of SIR predicited in cross-validaion - melanoma, women

Figure 4.4: Funnel plot of SIR predicited in cross-validaion - melanoma, women

4.2 Comparison between the variability of the HC/I ratio to the geographical variability of incidence over the districts

The geographical variability of incidence over the districts (\(\sigma_k\)) is obtained by fitting a Poisson mixed model to observed incidence rate, with age introduced as a continuous effect (modelled with a penalized regression spline) and district as a random intercept (standard deviation \(\sigma_k\)).

The Poisson model for incidence rate has the following formulation

nk<-list(age=unique(dt_calib$age)) 
form_i<-N_I~offset(log(py))+s(age,bs="tp")+s(dist,bs="re") 

and is applied to the two cancer sites under evaluation.

m_i_panc<-gam(form_i,family="quasipoisson",method="REML",
              data=dt_calib%>%filter(site=="Pancreas-men",SRC=="H"))
m_i_mela<-gam(form_i,family="quasipoisson",method="REML",
              data=dt_calib%>%filter(site=="Melanoma-women",SRC=="H"))

Values of \(\sigma_k\) are again extracted from the models using the gam.vcomp function.

4.3 Decision rule

The quantities calculated in the validations steps detailed in sections 4.1 and 4.2 can finally be summarised via the application of the decision rule, recalled below.

Decision rule for evaluating cancer site for which prediction of district-level incidence based on the HC/I ratio are valuable.

Figure 4.5: Decision rule for evaluating cancer site for which prediction of district-level incidence based on the HC/I ratio are valuable.

The criteria are summarised for each method and cancer site in the table below:

Table 4.3: Between district-variability of the HC/I ratios and geographical variability of I
Distribution of APE
\(\sigma_d\) v.s. \(\sigma_k\)
# districts with APE>30/>15%
\(\sigma_d\)
Cancer site # districts A H HA M \(\sigma_k\) A H HA M
Pancreas-men 20 0/1 0/2 0/0 0/1 0.08 0.02 0.06 0.02 0.06
Melanoma-women 17 2/8 2/7 1/7 5/11 0.18 0.18 0.21 0.15 0.17

For pancreas in men, as already noted, predictions with the the HA/I method has small errors in all districts (below 15%), and the method is eligible for predictions, labelled A++. For A/I, M/I and H/I method, predictions in respectively one, one and two districts have intermediate error size (between 15 and 30%), and are not eligible regarding this criteria. However, looking at the signal to noise trade-off (i.e. \(\sigma_d\) v.s. \(\sigma_k\)), the variability of the A/I ratio between district remains small compared to the geographical variation of incidence (0.02 v.s. 0.08), and the method is suitable for prediction (labelled A+). For the H/I and M/I method on the opposite, the variability of the ratios are too high (0.06 and 0.05 v.s. 0.08), and the method are not suitable for predictions (labelled B-). In this application, the chosen method of prediction is HA/I.

For melanoma in women, the errors are too consistent with all methods (APE>30% for 1 to 5 districts), which are all labelled B- -. No valuable prediction of incidence can thus be done for this cancer site.

5 Prediction of incidence in all districts and disease mapping, pancreas-men

For pancreas cancer in men, we saw in the previous sections that incidence can be predicted using the HA/I method with acceptable predictions error. Useful prediction of incidence, as detailed in section 3, can thus be produced in all the districts using the HA/I method.

In order to visualise geographical gradient of predicted incidence, we may represent standardized incidence ratio on a map.

5.1 Prediction of incidence and SIR calculation

The first step is to calculate SIR as in 3.2.1.

Code to compute SIR

dt_pred_panc<-dt_pred%>%filter(site=="Pancreas-men",SRC=="HA")

## Prediction of total number of cases by districts + IP
P<-CalibInc(mod=m_panc_HA,pred=dt_pred_panc,aggregate=~dist)%>%
  LogNormPI()

## Expected number of cases by districts
## (make use of weight argument to calculate incidence rate by age)
E<-dt_pred_panc%>%
  group_by(age)%>%
  mutate(w=1/sum(py))%>%
  CalibInc(mod=m_panc_HA,pred=,aggregate=~age,weight=w)%>%
  select(age,pred)%>%
  rename(rate=pred)%>%
  right_join(dt_pred_panc%>%select(dist,age,py))%>%
  group_by(dist)%>%
  summarise(E=sum(rate*py))

## Calculte SIR
SIR_panc<-P%>%left_join(E)%>%
  mutate(SIR=pred/E,low=low/E,up=up/E)

SIR_panc
## # A tibble: 95 x 8
##    dist   N_HC  pred    se   low    up     E   SIR
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 01      488  434.  26.5 0.836  1.06  461. 0.940
##  2 02      552  489.  28.4 1.02   1.28  429. 1.14 
##  3 03      399  356.  23.8 0.890  1.16  352. 1.01 
##  4 04      178  158.  15.3 0.808  1.18  163. 0.972
##  5 05      162  145.  14.6 0.904  1.34  132. 1.10 
##  6 06     1171 1045.  45.0 0.950  1.12 1012. 1.03 
##  7 07      326  290.  21.2 0.818  1.09  308. 0.942
##  8 08      237  210.  17.8 0.772  1.08  232. 0.908
##  9 09      149  133.  14.0 0.687  1.04  158. 0.840
## 10 10      330  294.  21.4 0.989  1.32  258. 1.14 
## # ... with 85 more rows

5.2 Graphical representation

The spatial coordinates of the 95 French’s administrative districts are available in the CalibInc package as a SpatialPolygons object : dep_fr_95. Note that the Corse “département” has a single geographical unit “20”, since the two entities “2A” and “2B” are deprecated and regrouped under the single unit “20” in the data.

Mapping of the SIR may then be achieved using the utility function ggMap:

SIR_panc%>%rename(id=dist)%>%## id is the binding key between the spatial coordinates dep_fr_95 and the SIR_panc data
  ggMap(data=.,SIR,map=dep_fr_95,
        color="RdYlGn",rev=T,limits=c(0.7,1.3))
Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

Figure 5.1: Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

5.3 Smoothing of SIR

5.3.1 Setting up BYM2 model

The bayesian spatial model used to smooth the SIR is fitted with the INLA package, using the “bym2” prior.

To specify the “bym2” prior, we first need to create a matrix that specify which districts share a common boundary (adjacency matrix). This is done by applying the nb2INLA function from the spdep package to the neighbouring structure of french districts, extracted from dep_fr_95 thanks to the poly2nb function. The adjacency matrix is stored in the working directory (file “graph.inla”).

library(INLA)
dep_fr_95.nb<-spdep::poly2nb(dep_fr_95)
spdep::nb2INLA("graph.inla",dep_fr_95.nb)

The districts are sorted in “graph.inla” according to the order of the districts in dep_fr_95.nb (same order as in the SIR_panc data). We need to convert the dist variable as a rank variable in order to match the adjacency matrix with the predictions:

SIR_panc<-SIR_panc%>%
  mutate(dist_rank=as.numeric(as.factor(dist)))

Secondly, we need to specify that the predictions are log-normally distributed. To do so, we take advantage of the correspondence between the log-normal and normal distribution, to convert our predictions on a gaussian scale (see (Chatignoux et al. 2019) Supplementary Material, section 3). Having a prediction distributed according to a log-normal distribution of mean \(\mu\) and standard error \(se\) is equivalent as having a log prediction following a normal distribution of variance \(s=\log(1+(se/\mu)^2)\) and mean \(\log(\mu)+s/2\).

We thus apply this transformation to our predictions, so we can use a gaussian distribution to fit the model.

SIR_panc<-SIR_panc%>%
  mutate(cv=se/pred,s=log(1+cv^2),lp=log(pred)+s/2)

Thirdly, we have to specify a prior distribution for the hyper parameters \(\sigma\) and \(\phi\) of the model.

The uniform a priori for \(\phi\) is easily specified with a Beta(1,1) distribution. For \(\sigma\), the parameters are specified in INLA in terms of precision, i.e. \(1/\sigma^2\). We used a prior gamma distribution for the precision, with parameters mimicking a uniform distribution for \(\sigma\) between 0.01 and 0.4 (usual range of geographical variability previously observed for incidence in France). To do so, we empirically calculate the mean and variance for the precision that correspond to the uniform distribution for \(\sigma\), and convert them in terms of scale and shape for the gamma distribution.

This is done with the calc.prec function:

calc.prec<-function(a1,b1){
  sigma.v0<-runif(n=10000,min=a1,max=b1)
  tau.v<-1/sigma.v0^2
  a2<-mean(tau.v)^2/var(tau.v)
  b2<-a2/mean(tau.v)
  c(a2,b2)
}

We now have all the elements to write down the model’s formulae:

formBym2<-lp~offset(log(E))+
  f(dist_rank,model="bym2",graph="graph.inla", scale.model = TRUE,constr=T,
    hyper=list(
      phi=list(prior="logitbeta",param=c(1,1)),
      prec=list(prior="loggamma",param=calc.prec(0.01,0.4))
    ))

To evaluate the model, we lastly need to specify that the distribution has a fixed and know variance. This can be done by tuning the control.family option (set to known precision of 1) and the scale argument of the inla function (set to the known precision 1/s):

fit_bym2<-inla(formBym2, family ="normal", data=SIR_panc,
               control.family = list(hyper =list(prec = list(fixed = TRUE, initial = 0))),
               scale=1/s,control.predictor=list(link=1,compute=T),
               control.compute=list(cpo=TRUE,dic=T,config=T,waic = TRUE))

5.3.2 Extract model parameters

The posterior estimates of the model key parameters \(\sigma\) and \(\phi\) are given as results of the inla function:

fit_bym2$summary.hyperpar
##                                mean         sd  0.025quant    0.5quant  0.975quant        mode
## Precision for dist_rank 287.4950051 86.2909937 156.3806279 274.6377099 491.6340541 250.8267540
## Phi for dist_rank         0.8426059  0.1239109   0.5258022   0.8744807   0.9874378   0.9632855

For \(\sigma\), the posterior are given in terms of precision (\(1/\sigma^2\)). To convert this on the standard error scale, we can sample the precision from its posterior distribution, and convert it to the standard error scale:

post_prec<-fit_bym2$marginals.hyperpar$"Precision for dist_rank"
post_sd<-inla.rmarginal(1000,inla.tmarginal(function(x) 1/sqrt(x), post_prec))
c(mean(post_sd),quantile(post_sd,p=c(0.025,0.975)))
##                  2.5%      97.5% 
## 0.06085827 0.04556318 0.07943903

The estimate \(\sigma\) of the variability of the SIR across all France’s districts (0.06 [0.05;0.08]) is close to the variability that we estimated on the registry districts only (\(\sigma_k\) = 0.08). Most of this variation is explained by the spatially structured component (\(\phi\)= 0.84 [0.53;0.98]).

5.3.3 Get smoothed predictions

Posterior estimates of the predicted number of cases are retrieved by taking the exponential of the model’s fitted values. This can be done with the following function:

get_pred<-function(model){
  exp.marg<-lapply(model$marginals.fitted.values,function(m) inla.tmarginal(function(x) exp(x), m))
  exp.marg<-t(sapply(exp.marg,function(em) inla.zmarginal(em,silent = TRUE)%>%do.call("c",.)))
  tibble(fit=exp.marg[,1],se.fit=exp.marg[,2],fit.low=exp.marg[,3],fit.up=exp.marg[,7])
}

We apply the function to the model and add the smoothed predictions to the original data:

SIR_panc<-SIR_panc%>%bind_cols(get_pred(fit_bym2))

5.3.4 Map of smoothed SIR

ggMap can again be used to map the smoothed SIR:

SIR_panc%>%rename(id=dist)%>%
  mutate(SIR_smooth=fit/E)%>%
  ggMap(data=.,SIR_smooth,map=dep_fr_95,
        legend = list(title="Smoothed\nSIR"),
        color="RdYlGn",rev=T,limits=c(0.7,1.3))
Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

Figure 5.2: Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

5.3.5 Comparison between smoothed an crude SIR

Compared to the map of “crude” SIR (figure 5.1), the map of smoothed SIR looks more homogeneous, showing a west-east gradient in the incidence of pancreatic cancer in men.

Given the large amount of the variation explained by the spatially structured component of the model, predicted incidence in districts having atypical SIR as compared to their neighbours have been shrunken toward the mean of the neighbours. For example, in north east of France, the map of crude SIR showed districts with low incidence (SIR<1) that shared a boundary with districts of high incidence (SIR>1). In the smoothed map, theses diverging pattern largely disappeared, providing a more realistic picture of the underlying risk surface.

It is important to recall that the amount of smoothing is higher for districts with uncertain crude prediction. This may be seen in the figure 5.3, that represents crude and smoothed SIR on a funnel plot (see section 4.1). In this funnel plot, each point represent a district, crude SIR being represented as red dots, smoothed SIR as green triangle. A line joins the crude and smooth estimate for the same district. We clearly see that the more uncertain crude SIR (at the bottom of the graph) undergo a larger amount of shrinkage than the other SIR.

Code for the graph on the effects of smoothing on point and precision estimates - Funnel plot - pancreas, men

dt_fun_SIR_panc<-SIR_panc%>%
  mutate(SIR_smooth=fit/E)%>%
  rename(SIR_crude=SIR,low_crude=low,up_crude=up,cv_crude=cv)%>%
  mutate(low_smooth=fit.low/E,up_smooth=fit.up/E,cv_smooth=se.fit/fit)

level<-0.95
z <- qnorm(1 - (1 - level)/2)
cvs<-range(dt_fun_SIR_panc$cv_crude,dt_fun_SIR_panc$cv_smooth)*c(0.9,1.2)
fun_bound<-tibble(cv=seq(cvs[1],cvs[2],length=100),pred=1,se=cv)%>%
  mutate(low = pred * sqrt(cv^2 + 1) * exp(-z * sqrt(log(cv^2 + 1))),
         up = pred * sqrt(cv^2 + 1) * exp(+z * sqrt(log(cv^2 + 1))))


dt_fun_SIR_panc%>%
  gather(var,val,SIR_smooth,SIR_crude,low_crude,up_crude,low_smooth,up_smooth,cv_smooth,cv_crude)%>%
  separate(var,c("var","type_est"))%>%
  spread(var,val)%>%
  ggplot()+
  geom_point(aes(SIR,1/cv,colour=type_est,shape=type_est),size=2)+
  geom_segment(data=dt_fun_SIR_panc,
               aes(y=1/cv_crude,yend=1/cv_smooth,x=SIR_crude,xend=SIR_smooth),colour="grey",size=.3)+
  geom_line(data=fun_bound,aes(up,1/se))+
  geom_line(data=fun_bound,aes(low,1/se))+
  labs(colour="Prediction type",shape="Prediction type")+
  theme(legend.position="bottom")
Effects of smoothing on point and precision estimates - Funnel plot - pancreas, men

Figure 5.3: Effects of smoothing on point and precision estimates - Funnel plot - pancreas, men

The preceding plot also show that the precision of the smoothed SIR are larger than the precision of the crude SIR. Indeed, as smoothing borrows information from the all the district to consolidate the estimation of the SIR in one district, points estimates are less variable. This latter point may be better seen in caterpillar plots, comparing crude and smoothed estimates (figure 5.4).

Code for the caterpillar plot of crude and smoothed SIR - pancreas, men

SIR_panc%>%
  mutate(SIR_smooth=fit/E)%>%
  rename(SIR_crude=SIR,low_crude=low,up_crude=up)%>%
  mutate(low_smooth=fit.low/E,up_smooth=fit.up/E)%>%
  gather(var,val,SIR_smooth,SIR_crude,low_crude,up_crude,low_smooth,up_smooth)%>%
  separate(var,c("var","type_est"))%>%
  spread(var,val)%>%
  group_by(type_est)%>%
  arrange(SIR)%>%
  mutate(id=row_number())%>%
  qplot(data=.,id,SIR)+geom_linerange(aes(ymin=low,ymax=up))+
  facet_wrap(~type_est)+
  geom_hline(yintercept=1)+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  xlab("Districts")
Caterpillar plot of crude and smoothed SIR - pancreas, men, 2007-2015

Figure 5.4: Caterpillar plot of crude and smoothed SIR - pancreas, men, 2007-2015

5.4 National predictions

Predictions of incidence at a national level are done specifying an intercept as the aggregation level (aggregate=~1). To obtain the annual mean number of incident cases over the 2007-2015 period, we furthermore added a 1/8 weight to CalibInc function.

Whatever the proxy used, the predictions are close to each other (maximal difference of 6%), even though M and H methods were not suitable proxies for district level predictions. It should however be noted that, given the higher variability of the H/I and M/I ratios, national predictions using H and M proxies are less precise than predictions using HA or A proxies.

dt_panc_HA<-dt_pred%>%filter(site=="Pancreas-men",SRC=="HA")
dt_panc_H<-dt_pred%>%filter(site=="Pancreas-men",SRC=="H")
dt_panc_A<-dt_pred%>%filter(site=="Pancreas-men",SRC=="A")
dt_panc_M<-dt_pred%>%filter(site=="Pancreas-men",SRC=="M")

## Prediction of total number of cases by districts + IP
CalibInc(mod=m_panc_HA,pred=dt_panc_HA,aggregate=~1,weight=1/8)%>%
  LogNormPI()
## # A tibble: 1 x 5
##    N_HC  pred    se   low    up
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7134. 6334.  68.6 6202. 6470.
CalibInc(mod=m_panc_H,pred=dt_panc_H,aggregate=~1,weight=1/8)%>%
  LogNormPI()
## # A tibble: 1 x 5
##    N_HC  pred    se   low    up
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6037. 6433.  122. 6200. 6677.
CalibInc(mod=m_panc_A,pred=dt_panc_A,aggregate=~1,weight=1/8)%>%
  LogNormPI()
## # A tibble: 1 x 5
##    N_HC  pred    se   low    up
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4107. 6085.  78.6 5933. 6241.
CalibInc(mod=m_panc_M,pred=dt_panc_M,aggregate=~1,weight=1/8)%>%
  LogNormPI()
## # A tibble: 1 x 5
##    N_HC  pred    se   low    up
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5662. 6066.  116. 5844. 6298.

6 References

Chatignoux, Édouard, Laurent Remontet, Jean Iwaz, Marc Colonna, and Zoé Uhry. 2019. “For a sound use of health care data in epidemiology: evaluation of a calibration model for count data with application to prediction of cancer incidence in areas without cancer registry.” Biostatistics (Oxford, England) 20 (3): 452–67. https://doi.org/10.1093/biostatistics/kxy012.

Spiegelhalter, David J. 2005. “Funnel plots for comparing institutional performance.” Statistics in Medicine 24 (8): 1185–1202. https://doi.org/10.1002/sim.1970.